fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel #1987

hsiang-c · 2025-07-03T21:34:45Z

Which issue does this PR close?

Closes #. #1685

Rationale for this change

When we enabled Iceberg Spark tests w/ Comet-enabled Spark in #1715

1 . We implicitly loaded org.apache.comet.CometSparkSessionExtensions b/c Iceberg depends on the patched Spark. This PR explicitly configures every SparkSession.Builder with .config("spark.plugins", "org.apache.spark.CometPlugin") so that we can depend on OSS Spark.

Thanks to @andygrove for pointing out.

What changes are included in this PR?

Depend on OSS Spark instead of a custom build w/ Spark patches. This saves time b/c we don't need to build custom Spark.
Split Iceberg Spark tests into 3 actions and run them in parallel. ENABLE_COMET is true for all 3 actions.

In Iceberg 1.8.1.diff, we applied 2 PRs from Iceberg: apache/iceberg#13786 and apache/iceberg#13793

Additionally, we temporarily

Disable spark.comet.exec.shuffle.enabled b/c it breaks several Iceberg Spark tests.
Disable spark.comet.exec.broadcastExchange.enabled b/c it breaks TestRuntimeFiltering in Iceberg Spark.

How are these changes tested?

We enabled Comet in Iceberg Spark tests from iceberg-spark, iceberg-spark-extensions and iceberg-spark-runtime modules.

kazuyukitanimura

pending CI

kazuyukitanimura · 2025-07-03T22:37:47Z

dev/diffs/iceberg/1.8.1.diff

+             .config("spark.sql.legacy.respectNullabilityInTextDatasetConversion", "true")
+             .config(
+                 SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), String.valueOf(RANDOM.nextBoolean()))
+            .config("spark.plugins", "org.apache.spark.CometPlugin")


This makes sense but could be error prone. If there is a new test that uses spark session, we miss enabling it.
Wondering if there is a good way to update all spark session at once...

@kazuyukitanimura

We're lucky in some cases b/c TestBase and ExtensionsTestBase consolidate SparkSession.Builder in the abstract class.

Unfortunately, other test classes and jmh build their own SparkSession each time :(

kazuyukitanimura

pending with CI

codecov-commenter · 2025-07-03T23:15:04Z

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.45%. Comparing base (f09f8af) to head (468da8a).
⚠️ Report is 395 commits behind head on main.

Files with missing lines	Patch %	Lines
...n/scala/org/apache/comet/rules/CometScanRule.scala	88.23%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1987      +/-   ##
============================================
+ Coverage     56.12%   58.45%   +2.32%     
- Complexity      976     1263     +287     
============================================
  Files           119      143      +24     
  Lines         11743    13212    +1469     
  Branches       2251     2360     +109     
============================================
+ Hits           6591     7723    +1132     
- Misses         4012     4264     +252     
- Partials       1140     1225      +85

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-07-08T17:45:37Z

I see that some tests are failing. I didn't run into this specific issue during my testing.

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1764.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1764.0 (TID 3312) (localhost executor driver): 
java.lang.ClassCastException: class org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to class org.apache.spark.sql.vectorized.ColumnarBatch (org.apache.spark.sql.catalyst.expressions.GenericInternalRow and org.apache.spark.sql.vectorized.ColumnarBatch are in unnamed module of loader 'app')

hsiang-c · 2025-07-15T20:49:18Z

Most of the exceptions in Iceberg Spark SQL Tests can be reproduced by

Follow the official guide to build Comet and Iceberg, configure Spark shell and populate the Iceberg table: https://datafusion.apache.org/comet/user-guide/iceberg.html
Query Iceberg metadata tables with an operator. Here is an example:

-- default is the catalog name used in local HadoopCatalog setup
scala> spark.sql(s"SELECT COUNT(*) from default.t1.snapshots").show()

25/07/15 13:06:16 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.ClassCastException: class org.apache.iceberg.spark.source.StructInternalRow cannot be cast to class org.apache.spark.sql.vectorized.ColumnarBatch (org.apache.iceberg.spark.source.StructInternalRow is in unnamed module of loader scala.reflect.internal.util.ScalaClassLoader$URLClassLoader @19ac93d2; org.apache.spark.sql.vectorized.ColumnarBatch is in unnamed module of loader 'app')
	at org.apache.spark.sql.comet.CometBatchScanExec$$anon$1.next(CometBatchScanExec.scala:68)
	at org.apache.spark.sql.comet.CometBatchScanExec$$anon$1.next(CometBatchScanExec.scala:57)
	at org.apache.comet.CometBatchIterator.hasNext(CometBatchIterator.java:51)
	at org.apache.comet.Native.executePlan(Native Method)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2(CometExecIterator.scala:155)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2$adapted(CometExecIterator.scala:154)
	at org.apache.comet.vector.NativeUtil.getNextBatch(NativeUtil.scala:157)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1(CometExecIterator.scala:154)
	at org.apache.comet.Tracing$.withTrace(Tracing.scala:31)
	at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:152)
	at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:203)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.comet.CometBatchIterator.hasNext(CometBatchIterator.java:50)
	at org.apache.comet.Native.executePlan(Native Method)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2(CometExecIterator.scala:155)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2$adapted(CometExecIterator.scala:154)
	at org.apache.comet.vector.NativeUtil.getNextBatch(NativeUtil.scala:157)
	at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$1(CometExecIterator.scala:154)
	at org.apache.comet.Tracing$.withTrace(Tracing.scala:31)
	at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:152)
	at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:203)
	at org.apache.spark.sql.comet.execution.shuffle.CometNativeShuffleWriter.write(CometNativeShuffleWriter.scala:106)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

parthchandra · 2025-07-17T01:33:27Z

@hsiang-c created #2033 to track this issue

.github/workflows/iceberg_spark_test.yml

- See actions/checkout#19

- apache#2092

andygrove · 2025-08-19T14:46:56Z

dev/diffs/iceberg/1.8.1.diff

+<<<<<<< Updated upstream
+index 2c37a52..503dbd6 100644
+=======
+index 2c37a52..3442cfc 100644
+>>>>>>> Stashed changes


@hsiang-c It looks like a merge conflict in the diff file

Sorry about that, fixed now.

andygrove

Thanks @hsiang-c!

hsiang-c changed the title ~~fix: [iceberg] Enable CometShuffleManager in Iceberg Spark tests~~ fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel Jul 3, 2025

kazuyukitanimura reviewed Jul 3, 2025

View reviewed changes

kazuyukitanimura approved these changes Jul 3, 2025

View reviewed changes

hsiang-c mentioned this pull request Jul 9, 2025

fix: [iceberg] Add LogicalTypeAnnotation in ParquetColumnSpec #2000

Merged

parthchandra mentioned this pull request Jul 17, 2025

[iceberg] Comet cannot execute some iceberg metadata table queries #2033

Closed

andygrove reviewed Aug 7, 2025

View reviewed changes

.github/workflows/iceberg_spark_test.yml Outdated Show resolved Hide resolved

andygrove reviewed Aug 7, 2025

View reviewed changes

.github/workflows/iceberg_spark_test.yml Outdated Show resolved Hide resolved

andygrove reviewed Aug 7, 2025

View reviewed changes

.github/workflows/iceberg_spark_test.yml Outdated Show resolved Hide resolved

hsiang-c force-pushed the enable_comet_shuffle branch from f5b7329 to 5c27644 Compare August 8, 2025 00:28

hsiang-c added 17 commits August 8, 2025 11:07

fix: [iceberg] Enable CometShuffleManager in Iceberg Spark tests

08aed02

Depends on OSS Spark

fe31fcf

Parallelize Iceberg Spark tests

c60fee2

fix: no need to change directory

f13384b

Change build order

921da99

Checkout action doesn't persist across jobs

d531a88

- See actions/checkout#19

Need to checkout first

c104d2c

Move Comet build to prepare stage

39ac3f7

Replicate setup 3 times

0311879

Apply spotless

c3d6d24

Fix tasks

a62c7ee

Fall back to Spark when the relations are one of Iceberg metadata tables

8856424

Conditionally enable Comet

7a86e5c

Remove diff

5ae2f69

Rebuild diff

fe257b3

Remove COMET_PARQUET_SCAN_IMPL=native_iceberg_compat

ee1a136

Update diff

eb19d38

hsiang-c and others added 2 commits August 12, 2025 12:58

Merge branch 'main' into enable_comet_shuffle

db1871a

quick fix

2056043

hsiang-c force-pushed the enable_comet_shuffle branch from 5c27644 to 2056043 Compare August 12, 2025 20:07

hsiang-c added 4 commits August 12, 2025 14:47

Port apache/iceberg#13786

5891249

Turn off shuffle for 0.10.0 release

b39824f

- apache#2092

Fallback to Spark whenever there are delete files

719914e

hack: recognize CometFilter

b52401e

andygrove mentioned this pull request Aug 13, 2025

fix: Improve logic for adding CopyExec operators to prevent memory corruption from mutable buffer reuse #2135

Closed

hsiang-c and others added 2 commits August 12, 2025 22:18

Check CometFilter; Use useCometBatchReads

721543c

Merge branch 'main' into enable_comet_shuffle

acab0d2

parthchandra mentioned this pull request Aug 13, 2025

[iceberg] Error loading in-memory sorter check class path #1982

Closed

hsiang-c added 3 commits August 14, 2025 16:58

Merge branch 'main' into enable_comet_shuffle

8e54915

Sync w/ main branch

b6b1a4b

Disable CometBroadcastExchange

ae916f4

hsiang-c mentioned this pull request Aug 18, 2025

docs: Update confs to bypass Iceberg Spark issues #2166

Merged

andygrove reviewed Aug 19, 2025

View reviewed changes

Fix diff

2a13941

hsiang-c force-pushed the enable_comet_shuffle branch from a3877c3 to 2a13941 Compare August 19, 2025 16:30

Added missing files to diff

468da8a

andygrove approved these changes Aug 19, 2025

View reviewed changes

andygrove merged commit 3a498c4 into apache:main Aug 19, 2025
96 checks passed

hsiang-c deleted the enable_comet_shuffle branch August 19, 2025 21:07

This was referenced Aug 19, 2025

Use vanilla Spark in Iceberg integration tests in CI #1986

Closed

feat: [iceberg] Enable Comet shuffle in Iceberg diff #2205

Merged

fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel #1987

fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel #1987

Uh oh!

Conversation

hsiang-c commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

hsiang-c Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsiang-c commented Jul 15, 2025

Uh oh!

parthchandra commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andygrove Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

hsiang-c Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hsiang-c commented Jul 3, 2025 •

edited

Loading

codecov-commenter commented Jul 3, 2025 •

edited

Loading

andygrove commented Jul 8, 2025 •

edited

Loading